Spotting the target: microarrays for disease gene discovery 

Paul S Meltzer 



Microarray technologies enable genome-scale expression 
measurements. Already proved to be of value for the functional 
analysis of individual genes and biological processes, the 
application of expression profiling to disease gene discovery is 
now growing in importance and practicality. 
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Abbreviations 

CGH comparative genomic hybridization 
NF2 neurofibromatosis type 2 
PARP poly(ADP-ribose) polymerase 
TD Tangier disease 

Introduction 

Positional cloning projects have been greatly facilitated by 
the availability of increasingly precise maps and sequence 
databases for diverse species. This same avalanche of 
genomic data has inspired an intense effort to study 
aspects of genome function in a high-throughput fashion. 
The parallel analysis of gene expression has emerged as 
one of the most productive embodiments of this approach. 

Practical technologies for large-scale gene-expression 
analysis are now being widely implemented. Microarrays 
comprising cither oligonucleotides or cl)\A fragments 
representing thousands of genes are well suited to the 
analysis of multiple samples 1 1.2|. 'lb obtain genome-scale 
expression data, mRNA from the source of interest is con- 
verted to an appropriately labeled form and hybridized to 
the microarray. Both radioactive and fluorescence-detec- 
tion strategies are in use to measure the resulting 
hybridization signal. The resulting raw data — an image 
obtained from a fluorescence scanner or phorphorim- 
ager — is processed with computer software to generate a 
spreadsheet of gene-expression values. The application of 
statistical techniques to microarray data allows classifica- 
tion and class discovery within a group of samples, and 
clustering of genes according to their pattern of expression. 

Microarrays have been successfully applied to characterize 
biological processes and to dissect pathways downstream of 
a particular gene of interest. Studies in the yeast 
Saa://arow\m imxrsiae. with its relatively small genome and 
highly tractable genetics, have led the way and continue 
with recent reports on signal transduction [3], meiosis [4] 
and transcript localization [5], Despite the challenges posed 
by their genome sizes, large-scale expression analysis in 
mammals is also becoming increasingly productive. 



As the technology for microarray analysis has matured and 
disseminated, new applications continue to be developed. 
One frequently discussed area is the potential use of 
microarray expression analysis in projects to positionally 
clone and discover disease genes. Although reviews of this 
topic outnumber reports of concrete achievement, it is 
appropriate to examine the state of the art and to consider 
how microarray analysis might accelerate these types of 
research. I discuss these points, together with recent 
developments in microarray research, in this review. 

How might microarrays help find hereditary 
disease genes? 

Several major approaches to locating hereditary disease 
genes might be imagined. In the simplest case, the target 
gene of interest might be identified directly by characteristic 
changes in expression level across a series of samples. 
Alternatively, statistical analysis of microarray data might 
aid gene discovery by revealing pathways related to the target 
gene and facilitating identification of candidate genes. 

Microarrays can also be used to analyze genomic DNA 
rather than mRNA. This is illustrated by the special case of 
copy-number change in cancer, where it is possible to use 
array-format comparative genomic hybridization (CGH) to 
define genes associated with cancer progression [6 ,, ,7,8*]. 
In CGH, gene copy number is measured in a DNA sample 
labeled with one fluorochrome by comparison to the signal 
obtained by simultaneous hybridization of normal DNA 
labeled with a second fluorochrome. In principle, copy- 
number data can be linked to expression data to define a 
list of candidate target genes associated with gain of chro- 
mosomal regions [9*, 10]. Although there is no example to 
date, tumor suppressors might be mapped by linking loss of 
gene expression to regions of deletion in tumors. 

Of course, microarrays can be used as sophisticated dot 
blots to screen arrays of clones isolated with techniques 
such as RDA [11]. (RDA [representational difference 
analysis] is a PCR-bascd subtraction technique that can be 
used to isolate DNA fragments that vary in abundance 
between two sources.) Stephan eta/. [12] have identified 
exons of the Niemann-Pick Type C disease isolated from 
arrayed genomic sequences using mRNA from cells differ- 
entially expressing NPC1. Finally, genes might be linked 
to specific phenotypes, particularly in yeast, through 
methods that allow genome-wide mutational screens using 
microarrays as a readout [13]. 

Finding the best candidate 

It is enticing to hope that analysis of microarray data might 
lead to the direct identification of disease genes. Ideally, one 
would compare a group of samples of varying genotype and 
identify good candidate genes by their pattern of gene 
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expression. The expected signature of a mutant gene is 
reduced expression level in samples with the abnormal 
allele. For this strategy to work, the mutant allele would have 
to be either deleted or result in a poorly expressed transcript. 

Fortunately, the phenomenon of nonsense-mediated 
decay of mRNA gives some reason to hope that this result 
might actually be achieved. Nonsense-mediated deca\ 
(reviewed in [14]) results in the degradation of certain 
mRNAs containing premature termination codons. This 
phenomenon has been observed in a number of disease 
genes [15,16]. 

In addition, abnormalities in 3'-untranslated region struc- 
ture that interfere with normal polyadenylation may also 
lead to reduced survival of transcripts [17]. A reduction in 
steady-state mRNA levels of disease genes cannot be 
assumed, however, because the competence of a transcript 
to undergo nonsense-mediated decay is variable and some 
mutations may result in exon skipping [18,19], as has been 
shown by Liu etal. [20] for the BRCA1 gene. This strategy 
also requires a sufficient number of samples from cells or 
tissues affected by the disease to help optimize the down- 
stream data analysis. 

Although obvious, an additional requirement of expres- 
sion-based strategies is that the target gene is actually 
represented on the microarrays used. Although arrays of 
more than 10,000 genes are commonplace and complete 
genome microarrays can be anticipated, they are not yet 
routinely available. It is also probably unrealistic to assume 
that only a single gene or a few genes will stand out from 
the crowd with sufficient clarity to allow easy candidate 
selection. More likely, a strategy combining positional 
information with expression information will be necessary. 

This combination of approaches has been used by Lawn 
etal. [21"] in the discovery of the Tangier disease (TD) 
gene ABC1. Microarray analysis led to the generation of a list 
of 175cDNAs underexpressed by 2.5-fold or more in the 
fibroblasts of an affected individual. By combining this data 
with linkage information that localized the disease gene to 
chromosome 9q between the markers WI-14706 and 
WI-4062, the candidate list was narrowed sufficiently to 
identify the gene ABC1, which did indeed carry mutations. 

Notably, Lawn etal. [21"] used commercial cDNA arrays 
containing 58,800 cDNAs, which presumably provided a 
reasonably thorough genome scan. One might imagine that 
regional searches could be made by constructing targeted 
microarrays covering a particular candidate region. This 
has been done for the X chromosome and for chromosome 
17q [9«,22"]. 

It is important to bear in mind that almost all research 
employing microarray expression analysis depends heavily 
on statistical analysis to extract the most useful information 
from the huge number of data points generated. This means 



that any investigator attempting to use microarrays for 
disease gene discovery will also seek to go beyond this direct 
type of search and also examine the broader effects of muta- 
tion on gene expression in samples from affected individuals. 

If one were not able to identify easih a candidate gene by 
virtue of its underexpression, perhaps the recognition of 
pathways altered consistently across a set of specimens might 
lead to the identification of good candidate genes or, at the 
vcr\ least, might illuminate some aspects of pathogenesis. 

Finding the disease pathway affected by 
known genes 

The complexity of microarray data is illustrated by anoth- 
er interesting feature of the TD data — the overexpression 
of 375 cDNAs by 2.5-fold or more. This result, revealing a 
total of 550 cDNAs with altered expression, is probably 
typical of what might be expected in most projects. In 
addition to innumerable technical factors, variations in 
gene expression across samples might be due to random 
fluctuations or confuting variables such as age, sex, site of 
sample and irrelevant genetic variations. Still, it would 
seem reasonable to suppose that the presence of a muta- 
tion in a pathway might frequently lead to secondary 
events affecting the level of expression of many other 
genes functionally connected to the disease gene. 

Most published examples attempting to place genes from 
microarray data on samples carrying mutations into coherent 
pathways are in the setting of model systems for which the 
mutation is already known. McNeish etal. [23*] have exam- 
ined a mouse model of TD with microarrays containing 
11,000 genes and have identified 131 genes with greater 
than 1.8-fold differential regulation, many of which can be 
grouped into a few function-related categories. Their study 
demonstrates how studies of a relatively tractable experi- 
mental model can enhance the value of data obtained from 
human samples. 

Likewise, Soukas et al. [24*] examined gene expression in 
white adipose tissue from mice expressing varying levels of 
the leptin gene. Seventy-seven genes were dysrcgulatcd by 
threefold or more in these oblob mice, including a number of 
key genes in fat metabolism. One cluster of genes was coor- 
dinately regulated by SREBP-1/ADD1, but the regulating 
mechanisms linking genes in several other clusters remain 
unknown. Although the complete pattern of changes 
observed cannot be explained as yet, the relevance of the 
leptin gene to fat metabolism is amply demonstrated. 

Simbulan-Rosenthal etal. [25*] examined fibroblasts from 
mice deficient in poly(ADP-ribose) polymerase (PARP) 
with microarrays covering 11,000 genes and identified 91 
genes differentially regulated by at least twofold relative to 
wild-type fibroblasts. About 40% of these could be related 
to either the cell cycle or remodeling of the cytoskeleton or 
extracellular matrix — processes known to be associated 
with PARP function. 
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Callow eta/. [26] examined livers from apolipoprotein AI 
knockout mice, scavenger receptor Bl transgenic mice and 
wild-type mice on microarrays containing 5600 cDNAs. 
They used /-test statistics to identify a small number of 
genes that differed significantly across these conditions. 

For disease gene discovery, the interpretation of expres- 
sion data in terms of pathways is more difficult because 
there is no a priori knowledge of the disease gene function. 
This leads to a consideration of the process of grouping dif- 
ferentially expressed genes into pathways. 

Placing genes in pathways to gain clues about 
unknown genes 

Can pathways actually be discerned from microarray data? 
It is worthwhile considering some of the individual steps in 
the process of deducing pathway information from these 
data. Clustering of genes into co-regulated groups is com- 
putationally straightforward and readily generates this type 
of information [27]. Similarly, there has been great success 
in classifying biological samples from microarray data, 
particularly for cancer specimens [28*,29**-3 1 **,32]. These 
studies are promising in identifying critical genes for can- 
cer progression at the expression level, although these are 
not necessarily 'disease genes' in the genetic sense [33**]. 

Nonetheless, Hedenfalktf «/. [34"] have even shown that it 
is possible to sort breast cancer specimens according to the 
presence of hereditary mutations in BRCA1 or BRCA2. One 
of the most striking results in their study was the demon- 
stration that a sample that clustered with those from patients 
carrying mutations in BRCA1 lacked a BRCA1 mutation but 
was highly methylated at the BRCA1 promoter. 

It might be hoped that this approach could aid complex 
disease gene discovery by sorting samples into groups that 
share a common genetic defect. When combined with 
positional data from linkage analysis, such an approach 
might be expected to take on a significant role in the study 
of complex disease. 

In contrast to clustering samples and genes, the interpreta- 
tion of expression data to infer the pathway affected by a 
disease gene mutation is much more problematic. The ini- 
tial problem one faces in this type of analysis is the limited 
annotation of the genome. When examining an expression 
database, one immediately encounters difficulty in placing 
penes into functional categories. This is beset with a num- 
ber of obstacles, the first of which are the numerous aliases 
that confuse gene nomenclature. 

The introduction of two on-line resources, LocusLink and 
Refseq, have gone a long way towards overcoming this prob- 
lem by providing a unique identifier and curated sequence 
for each gene [35], This is absolutely critical to the next 
phase of analysis, which is the cross-reference to other data- 
bases of gene function including, most importantly, literature 
databases. Frequently, different functions or interpretations 



of gene function are linked to distinct aliases for a given 
gene. Only by thoroughly combing the literature, can the 
most comprehensive picture of gene function be obtained. 
Substantial efforts are being made to organize the genes of 
known function into meaningful categories. 

Although a detailed discussion of the problem of gene anno- 
tation is beyond the scope of this review, the public 
availability of certain resources should be noted. In particu- 
lar, the Gene Ontologx consortium uses a common language 
to organize functional information in all species [36]. 
Currently, the Gene Ontology database contains database 
links for Drosophila, S. arnes/ae, mouse and Caenorhabditis 
elegans. Genes are categorized in three hierarchical schemes 
according to molecular function, biological process and 
cellular component. 

Methods to process groups of genes with respect to 
literature databases are also under development [37-39]. 
One system, High-density Array Pattern Interpreter 
(HAPI; http://array.ucsd.edu/hapi/), is publicly available. 
It is anticipated that search engines that can carry out 
these computations with the output of expression data- 
bases will significantly accelerate the process of organizing 

Although it is relatively straightforward to identify lists of 
genes that are co-regulated across a set of samples, this 
may not be a sufficiently sensitive method to extract func- 
tionally related genes. Intensive efforts to establish 
alternate computational methods are continuing. 

Seungchan et al. [40] have described a multivariate tech- 
nique that has the potential to identify relationships 
among genes that are refractory to methods based on linear 
correlation. Akutsu et al. [41] have proposed a method for 
modeling gene expression in terms of Boolean networks, 
whereas Friedman et al. [42] have proposed a Bayesian 
method. Hastie <?/"«/. [43] have described a method termed 
'gene shaving', which differs from hierarchical clustering in 
that genes may belong to more than one cluster. Brown 
etal. [44] have advocated the use of method based on the 
theory of 'support vector machines', a computer learning 
method that they have adapted to the functional catego- 
rization of expression data. 

Finding regulatory motifs 

One great challenge remaining in the analysis of mam- 
malian expression data will be to link this information to 
regulatory elements in the genome sequence. Promising 
results in yeast continue to appear. Iyer et al. [45**] have 
taken advantage of the small size of the yeast genome to 
array non-coding DNA and identify the genes regulated 
by the cell-cycle transcription factors SBF and MBF Ren 
et al. [46**] have achieved similar results for Gal4 and 
Stel2, and Livesey etal. [47] have identified the response 
element configuration and genes responsive to the mouse 
homeobox gene Crx. 
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The development of progressively more sophisticated 
computational methods increases optimism that genes 
related to a phenotype can be accurately extracted and 
placed in functionally related groups to help generate 
new hypotheses. Even with this goal accomplished, one 
would expect that the effects of mutation on one bio- 
chemical pathway will radiate to affect numerous other 
pathways. Identifying the pathway primarily affected will 
be a significant challenge. 

Using microarrays to map genomic DNA 

Although using microarrays to identify regions of copy- 
number change in cancers has received the most 
attention, array format CGH might also be useful for 
mapping hereditary disease genes. Bruder et al. [48**] 
have used microarrays tiled across a 7-Mb region includ- 
ing the neurofibromatosis type 2 gene (NF2) to analyze 
DNA from 1 16 NF2 patients. Using this exquisitely accu- 
rate system, they were able to identify 24 patients with 
gene deletions and show that there was no correlation 
with disease severity. 

In principle, this type of approach could be applied to a 
region containing an unknown disease gene. Because posi- 
tional doners frequently assemble contigs covering regions 
of linkage, the availability of genomic clones may not be 
problematic. However, the technology for arraying and 
accurately determining copy number in this setting is still 
confined to a few laboratories. 

Conclusions 

Unquestionably, large-scale expression analysis is now 
established in the study of genome function. The power of 
this approach continues to be enhanced by technical 
advances and, importantly, by the development of very 
large coherent expression databases from samples collect- 
ed across a broad range of conditions [49**]. The recent 
report from Shoemaker et al. [50] points to the future with 
microarrays composed of over one million oligonucleotides 
representing 442,785 exons predicted from the draft 
human genome sequence. These developments suggest 
that microarray analysis will increasingly merit considera- 
tion as an ancillary technique to facilitate hereditary 
disease gene discovery. 

Update 

Loftus and Pavan have recently used melanocyte-specific 
microarrays to identify a mouse coat color gene (S Loftus, 
W Pavan, personal communication). 
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